Okay so we have the three trees, rooted on rh1988 and ag342.

Things to note here are that the following species get pretty much 0 orthologs. I think a run without them is needed. This will increase the orthologs identified considerably.

And the results of the SCO between percentage speceis.

90% - greater 72 species, SCOs is 0.
85% - greater 68 species, SCOs is 2.
80% - greater 64 species, SCOs is 52.
75% - greater 60 species, SCOs is 164.
70% - greater 56 species, SCOs is 255.
65% - greater 52 species, SCOs is 308.
60% - greater 48 species, SCOs is 355.
55% - greater 44 species, SCOs is 419.
50% - greater 40 species, SCOs is 488.

All Samples

Greater 80% Orthologs

So ortholog > 80% did not work very well. In the RAXML output it consistently could not find the relationship between the outgroups (As can be seen from the chaos at the bottom of the plot).

Greater 75% Orthologs

Greater 70% Orthologs

Removing the Low Orthogorup Species

So this is with the following species removed due to having pretty much no orthogroups.

The orthogroups identified from this analysis, is much better (Scroll to the top to compare with the other analysis).

100% SCO is 1.
90% - greater 61 species, SCOs is 143.
85% - greater 58 species, SCOs is 227.
80% - greater 54 species, SCOs is 285.
75% - greater 51 species, SCOs is 314.
70% - greater 48 species, SCOs is 347.
65% - greater 44 species, SCOs is 416.
60% - greater 41 species, SCOs is 455.
55% - greater 37 species, SCOs is 560.
50% - greater 34 species, SCOs is 622.

I ran this for 90%, 85%, 80%, and 75% and rooted on

Greater 90% Removed Species

So the tree file has 68 species and the metadata has 94 species.

ggtree(g90rem_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = label, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 14) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

Greater 85% Removed Species

Greater 80% Removed Species

Greater 75% Removed Species

So with removing the low orthogroup species this tree is looking way better (from my bioinformatic point, biologically I have no idea sorrrrrry lol).

  • The support for the outgroup and everything else is pretty low which is interesting.
  • As we increase orthogroups (through >x% species) it seems the tree holds the same shape and some of the low bootstraps bump up.
  • Actually, it seems greater 85% seems to give the highest bootstrap support for the trees.

BUSCO QC of Re-made Genomes

Okay so I have run trimming, the prokaryotic and viral contamination (with bit scores 150 and 200), and also removed contigs < 500bp. I used the 150 and 200 input for BUSCO with the fungi_odb12 and ascomycota_odb12 databases.

Things to note 1) The 100 versus 200 does not really impact scores at all. Therfore, lets go with the 200 filter. 2) These were never going to be high because of the DNA extraction and amplification you did. The fact they all similar and around 10% is good. 2) A few things fail/do not assemble, so we can remove these. Note these are also the ones removed from the original analysis due to them having to few orthogroups.

Also not included are the ones which did not pass the cleaning of 500bp
- tutex -> this had contigs only <500, so there is nothing in the cleaned and sorted fasta file.
- ag340 -> this had contigs only <500, so there is nothing in the cleaned and sorted fasta file.
- jt4453 -> this had contigs only <500, so there is nothing in the cleaned and sorted fasta file.

Re-done Gene predictions RAXML

We have a total of 84 species that worked through gene prediction AND (even with ones with <1000 genes) we still got really good SCOs between all species.

I did a preliminary analysis, and the three samples (rh722, rh1006, ag335) made a weird group so we have removed them. Also rooted on three species and this is what it gave.

All SCO with Three Root

So when using the two species Arthur wanted for rooting it was a little weird (i.e. cant find as their is another species in there messing it up). The tree above shows there are three Tiber spinoreticulatum so maybe using all three will fix the rooting problem. (Spoiler, it does). Removed those old weird trees so now it is just the good ones with rooting with the three species.

Number of species: 80 Number of SCO: 69 (niiiiiice) Concatenation length: 42,911 bp

Greater 95% Orthologs

Number of species: 80
Number of SCO: 310
Concatenation length: 202,736 bp

Greater 90% Orthologs

Number of species: 80
Number of SCO: 370
Concatenation length: 239,864 bp

Greater 85% Orthologs

Number of species: 80
Number of SCO: 387
Concatenation length: 250,299 bp

Ben To Do

So I think some of the following steps should be done.

Questions and Answers from Arthur - 22 July 2025

  1. What method was used for library prep and sequencing.
  1. What was the original assembly command with spades ?
  1. Where are the raw reads
  1. What did you sequence: pure cultures, dirty samples, etc. etc.
  1. If dirty samples, did you do a prokaryotic/viral/metazoan cleaning before assembly?
  1. From the data we have, there are species with no orthologs really (see excel).
  1. If I make a spreadsheet with all the names (i.e. ag213, jf1136 etc.) can you fill in the column inormation for me.
  1. Things to remove (i.e. rerun everything from protortho onwards with these removed).
  1. Rooting on the tree will be which species
---
title: "R Notebook"
output: html_notebook
---

Okay so we have the three trees, rooted on **rh1988** and **ag342**.  

- Orthologs > 80% species -> concatenation length = 23,805 bp.  
- Orthologs > 75% species -> concatenation length = 80,289 bp.  
- Orthologs > 70% species -> concatenation length = 13,0254 bp.  

Things to note here are that the following species get pretty much 0 orthologs. I think a run without them is needed. This will increase the orthologs identified considerably. 

- ag335.names_modified.fas.  
- ag340.names_modified.fas.  
- ag352.names_modified.fas.  
- bc64.names_modified.fas.  
- jt11146.names_modified.fas.  
- tr117.names_modified.fas.  
- tr64.names_modified.fas.  
- tutex.names_modified.fas.  
- zb1227.names_modified.fas.  
- zb1644.names_modified.fas.  
- zb3441.names_modified.fas.  
- zb3935.names_modified.fas.  

And the results of the SCO between percentage speceis.  

90% - greater 72 species, SCOs is 0.  
85% - greater 68 species, SCOs is 2.  
80% - greater 64 species, SCOs is 52.  
75% - greater 60 species, SCOs is 164.  
70% - greater 56 species, SCOs is 255.  
65% - greater 52 species, SCOs is 308.  
60% - greater 48 species, SCOs is 355.  
55% - greater 44 species, SCOs is 419.  
50% - greater 40 species, SCOs is 488.  

```{r library loading, include = F}
library(tidyverse)
library(ggtree)
library(treeio)
library(tidytree)
sessionInfo()
```

```{r making collapse work, include = F}
nodeid.tbl_tree <- utils::getFromNamespace("nodeid.tbl_tree", "tidytree")
rootnode.tbl_tree <- utils::getFromNamespace("rootnode.tbl_tree", "tidytree")
offspring.tbl_tree <- utils::getFromNamespace("offspring.tbl_tree", "tidytree")
offspring.tbl_tree_item <- utils::getFromNamespace(".offspring.tbl_tree_item", "tidytree")
child.tbl_tree <- utils::getFromNamespace("child.tbl_tree", "tidytree")
parent.tbl_tree <- utils::getFromNamespace("parent.tbl_tree", "tidytree")
```

## All Samples 
### Greater 80% Orthologs

```{r importing g80 tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g80/RAxML_bipartitions.tubme_scogreat80_100bs") -> g80_data
g80_data$edge.length <- g80_data$edge.length * 70
```

```{r g80 making the node shapes and colours, include = F}
tree_df_80 <- fortify(g80_data)

tree_df_80 <- tree_df_80 %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r plotting the g80 tree, fig.width = 15, fig.height = 15, echo = F, results='hide'}
ggtree(g80_data, ladderize = T, size = 0.5) +
  geom_tiplab(size = 3,
              align = F,
              family = "mono") +
  xlim(-1, 15) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = tree_df_80,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %")
```

So ortholog > 80% did not work very well. In the RAXML output it consistently could not find the relationship between the outgroups (As can be seen from the chaos at the bottom of the plot). 


### Greater 75% Orthologs

```{r importing g75 tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g75/RAxML_bipartitions.tubme_scogreatg75_100bs") -> g75_data
g75_data$edge.length <- g75_data$edge.length * 70
```

```{r g75 making the node shapes and colours, include = F}
tree_df_75 <- fortify(g75_data)

tree_df_75 <- tree_df_75 %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_75)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r plotting the g75 tree, fig.width = 15, fig.height = 15, echo = F}
ggtree(g75_data, ladderize = T, size = 0.5) +
  geom_tiplab(size = 3,
              align = F,
              family = "mono") +
  xlim(-1, 13) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = tree_df_75,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %")
```

### Greater 70% Orthologs

```{r importing g70 tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g70/RAxML_bipartitions.tubme_scogreatg70_100bs") -> g70_data
g70_data$edge.length <- g70_data$edge.length * 70
```

```{r g70 making the node shapes and colours, include = F}
tree_df_70 <- fortify(g70_data)

tree_df_70 <- tree_df_70 %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_70)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r plotting the g70 tree, fig.width = 15, fig.height = 15, echo = F}
ggtree(g70_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(size = 3, 
              align = F, 
             family = "mono") +
  xlim(-1, 12) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x=branch, label = label), 
  #              vjust=-.5, size=3)
    geom_point(
    data = tree_df_70,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %")
```


## Removing the Low Orthogorup Species 

So this is with the following species removed due to having pretty much no orthogroups.  

- ag335.names_modified.fas.  
- ag340.names_modified.fas.  
- ag352.names_modified.fas.  
- bc64.names_modified.fas.  
- jt11146.names_modified.fas.  
- tr117.names_modified.fas.  
- tr64.names_modified.fas.  
- tutex.names_modified.fas.  
- zb1227.names_modified.fas.  
- zb1644.names_modified.fas.  
- zb3441.names_modified.fas.  
- zb3935.names_modified.fas. 

The orthogroups identified from this analysis, is much better (Scroll to the top to compare with the other analysis).  

100% SCO is 1.  
90% - greater 61 species, SCOs is 143.  
85% - greater 58 species, SCOs is 227.  
80% - greater 54 species, SCOs is 285.  
75% - greater 51 species, SCOs is 314.  
70% - greater 48 species, SCOs is 347.  
65% - greater 44 species, SCOs is 416.  
60% - greater 41 species, SCOs is 455.  
55% - greater 37 species, SCOs is 560.  
50% - greater 34 species, SCOs is 622.  

I ran this for 90%, 85%, 80%, and 75% and rooted on 

- 143 Orthologs > 90% species -> concatenation length = 71,383 bp
- 227 Orthologs > 85% species -> concatenation length = 120,775 bp
- 285 Orthologs > 80% species -> concatenation length = 155,666 bp
- 314 Orthologs > 80% species -> concatenation length = 169,476 bp

```{r tree metadata for all, include = F}
read.csv(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/tree_metadata.csv") %>% 
  mutate(across(where(is.character), ~na_if(., ""))) -> tree_md
# View(tree_md)
```


### Greater 90% Removed Species

```{r importing g90rem tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g90rem/RAxML_bipartitions.tubme_scogreat90rem_100bs") -> g90rem_data
g90rem_data$edge.length <- g90rem_data$edge.length * 120
```

So the tree file has 68 species and the metadata has 94 species. 

```{r annotating tree data g90rem, include = F}
## getting the correct 68 samples from metadata
g90rem_data %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_g90

## combining metadata with tree object. 
g90rem_data %>% 
  as_tibble() %>% 
  full_join(tree_md %>% 
              dplyr::filter(sample %in% samples_g90$label), 
            join_by(label == sample)) %>% 
  as.treedata() -> g90rem_data
```

```{r g90rem making the node shapes and colours, include = F}
tree_df_g90rem <- fortify(g90rem_data)

tree_df_g90rem <- tree_df_g90rem %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r getting nodelabs for better plot, fig.width = 15, fig.height = 15}
ggtree(g90rem_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = label, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 14) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g90rem, include = F}
highlight_samples <- c("jt13224", "vk4538", "am1126", "mes1418")

g90rem_data %>% 
  as_tibble() %>% 
  mutate(highlight = label %in% highlight_samples) %>% 
  as.treedata() -> g90rem_data
```

```{r plotting the g90rem tree, fig.width = 15, fig.height = 15, echo = F, results='hide'}
ggtree(g90rem_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Genus_species, 
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 15) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = tree_df_g90rem,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N105 Colouring
  geom_cladelab(
    node = 105,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 105,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Asia N96 Colouring
  geom_cladelab(
    node = 96,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 1.5,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 96,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Europe N103 Colouring
  geom_cladelab(
    node = 103,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.1,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 103,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## North America N115 Colouring
  geom_cladelab(
    node = 119,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 1.9,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 119,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) + 
    ## North America N72 Colouring
  geom_cladelab(
    node = 72,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 1.1,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 72,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  )
```


### Greater 85% Removed Species

```{r importing g85rem tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g85rem/RAxML_bipartitions.tubme_scogreat85rem_100bs") -> g85rem_data
g85rem_data$edge.length <- g85rem_data$edge.length * 120
```

```{r annotating tree data g85rem, include = F}
## getting the correct 68 samples from metadata
g85rem_data %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_g85

## combining metadata with tree object. 
g85rem_data %>% 
  as_tibble() %>% 
  full_join(tree_md %>% 
              dplyr::filter(sample %in% samples_g85$label), 
            join_by(label == sample)) %>% 
  as.treedata() -> g85rem_data
```

```{r g85rem making the node shapes and colours, include = F}
tree_df_g85rem <- fortify(g85rem_data)

tree_df_g85rem <- tree_df_g85rem %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r getting nodelabs for better plot g85, fig.width = 15, fig.height = 15, include = F}
ggtree(g85rem_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Continent, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 14) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g85rem, include = F}
highlight_samples <- c("jt13224", "vk4538", "am1126", "mes1418", "jt36268")

g85rem_data %>% 
  as_tibble() %>% 
  mutate(highlight = label %in% highlight_samples) %>% 
  as.treedata() -> g85rem_data
```

```{r plotting the g85rem tree, fig.width = 15, fig.height = 15, echo = F, results='hide'}
ggtree(g85rem_data, ladderize = T, size = 0.5) +
    geom_tiplab(aes(label = Genus_species, 
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 16) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = tree_df_g85rem,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N111 Colouring
  geom_cladelab(
    node = 111,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.2,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 111,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Asia N126 Colouring
  geom_cladelab(
    node = 126,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 1.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 126,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Europe N133 Colouring
  geom_cladelab(
    node = 133,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 133,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## North America N92 Colouring
  geom_cladelab(
    node = 92,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 2.1,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 92,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) + 
  ## North America N72 Colouring
  geom_cladelab(
    node = 72,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 1.2,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 72,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  )
```


### Greater 80% Removed Species

```{r importing g80rem tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g80rem/RAxML_bipartitions.tubme_scogreat80rem_100bs") -> g80rem_data
g80rem_data$edge.length <- g80rem_data$edge.length * 120
```

```{r annotating tree data g80rem, include = F}
## getting the correct 68 samples from metadata
g80rem_data %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_g80

## combining metadata with tree object. 
g80rem_data %>% 
  as_tibble() %>% 
  full_join(tree_md %>% 
              dplyr::filter(sample %in% samples_g80$label), 
            join_by(label == sample)) %>% 
  as.treedata() -> g80rem_data
```

```{r g80rem making the node shapes and colours, include = F}
tree_df_g80rem <- fortify(g80rem_data)

tree_df_g80rem <- tree_df_g80rem %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_g80rem)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r getting nodelabs for better plot g80, fig.width = 15, fig.height = 15, include = F}
ggtree(g80rem_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Location, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 14) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g80rem, include = F}
highlight_samples <- c("jt13224", "am1126", "mes1418", "jt36268")

g80rem_data %>% 
  as_tibble() %>% 
  mutate(highlight = label %in% highlight_samples) %>% 
  as.treedata() -> g80rem_data
```

```{r plotting the g80rem tree, fig.width = 15, fig.height = 15, echo = F, results='hide'}
ggtree(g80rem_data, ladderize = T, size = 0.5) +
    geom_tiplab(aes(label = Genus_species, 
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 16) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = tree_df_g80rem,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N86 Colouring
  geom_cladelab(
    node = 86,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.2,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 86,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Asia N79 Colouring
  geom_cladelab(
    node = 79,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 1.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 79,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Europe N76 Colouring
  geom_cladelab(
    node = 76,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 76,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## North America N100 Colouring
  geom_cladelab(
    node = 100,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 2.1,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 100,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) + 
  ## North America N116 Colouring
  geom_cladelab(
    node = 116,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 1.2,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 116,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  )
```


### Greater 75% Removed Species

```{r importing g75rem tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/raxml_g75rem/RAxML_bipartitions.tubme_scogreat75rem_100bs") -> g75rem_data
g75rem_data$edge.length <- g75rem_data$edge.length * 120
```

```{r annotating tree data g75rem, include = F}
## getting the correct 68 samples from metadata 
g75rem_data %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_g75

## combining metadata with tree object. 
g75rem_data %>% 
  as_tibble() %>% 
  full_join(tree_md %>% 
              dplyr::filter(sample %in% samples_g75$label), 
            join_by(label == sample)) %>% 
  as.treedata() -> g75rem_data
```

```{r g75rem making the node shapes and colours, include = F}
tree_df_g75rem <- fortify(g75rem_data)

tree_df_g75rem <- tree_df_g75rem %>%
  filter(!isTip & !is.na(label)) %>%
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_g75rem)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r getting nodelabs for better plot g75, fig.width = 15, fig.height = 15, include = F}
ggtree(g75rem_data, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Continent, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 14) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g75rem, include = F}
highlight_samples <- c("jt13224", "am1126", "mes1418", "jt36268", "flas61961")

g75rem_data %>% 
  as_tibble() %>% 
  mutate(highlight = label %in% highlight_samples) %>% 
  as.treedata() -> g75rem_data
```

```{r plotting the g75rem tree, fig.width = 15, fig.height = 15, echo = F, results='hide'}
ggtree(g75rem_data, ladderize = T, size = 0.5) +
    geom_tiplab(aes(label = Genus_species, 
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 16) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = tree_df_g75rem,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N104 Colouring
  geom_cladelab(
    node = 104,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.2,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 104,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Asia N97 Colouring
  geom_cladelab(
    node = 97,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 1.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 97,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## Europe N118 Colouring
  geom_cladelab(
    node = 118,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 1.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 118,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) +
  ## North America N119 Colouring
  geom_cladelab(
    node = 119,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 2.1,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 119,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  ) + 
  ## North America N72 Colouring
  geom_cladelab(
    node = 72,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 1.2,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 72,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3, 
    to.bottom = T
  )
```


So with removing the low orthogroup species this tree is looking way better (from my bioinformatic point, biologically I have no idea sorrrrrry lol).  

- The support for the outgroup and everything else is pretty low which is interesting.  
- As we increase orthogroups (through >x% species) it seems the tree holds the same shape and some of the low bootstraps bump up.
- Actually, it seems greater 85% seems to give the highest bootstrap support for the trees.  


## BUSCO QC of Re-made Genomes 

Okay so I have run trimming, the prokaryotic and viral contamination (with bit scores 150 and 200), and also removed contigs < 500bp. I used the 150 and 200 input for BUSCO with the `fungi_odb12` and `ascomycota_odb12` databases. 

```{r reading in busco and making dfs, include = F}
read.csv(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/asm_pipeline/busco_summary.tsv", 
         sep = "\t") -> busco_res

busco_res %>% 
  dplyr::filter(Database %in% "fung") %>% 
  dplyr::mutate(Total = 1122) %>%
  dplyr::filter(!Measure %in% c("Complete", "Total")) %>% 
  dplyr::mutate(Percentage = (Value/Total)*100) -> busco_fung

busco_res %>% 
  dplyr::filter(Database %in% "asco") %>% 
  dplyr::mutate(Total = 2826) %>%
  dplyr::filter(!Measure %in% c("Complete", "Total")) %>% 
  dplyr::mutate(Percentage = (Value/Total)*100) -> busco_asco
```

```{r plotting fungi BUSCO DNA, fig.width=8, fig.height=20, echo = F, results='hide'}
ggplot(data = busco_fung, 
       aes(y = Percentage, 
           x = Sample, 
           fill = Measure, 
           label = Percentage)) +
  geom_bar(stat = "identity", 
           position = position_stack(reverse = T)) +
  theme_bw() + 
  coord_flip() +
  scale_x_discrete(expand = c(0,0)) + 
  scale_y_continuous(labels = c("0%", "20%", "40%", "60%", "80%", "100%"), 
                     breaks = c(0, 20, 40, 60, 80, 100), 
                     expand = c(0,0)) +
  scale_fill_manual(name = "busco_measure", 
                    values = c("steelblue1", "steelblue4", "yellow3", "red3")) + 
  theme(text = element_text(size=10, family = "Arial")) + 
  ggtitle("BUSCO Results - Fungi_odb12 Database")
```

```{r plotting asco BUSCO DNA, fig.width=8, fig.height=20, echo = F, results='hide'}
ggplot(data = busco_asco, 
       aes(y = Percentage, 
           x = Sample, 
           fill = Measure, 
           label = Percentage)) +
  geom_bar(stat = "identity", 
           position = position_stack(reverse = T)) +
  theme_bw() + 
  coord_flip() +
  scale_x_discrete(expand = c(0,0)) + 
  scale_y_continuous(labels = c("0%", "20%", "40%", "60%", "80%", "100%"), 
                     breaks = c(0, 20, 40, 60, 80, 100), 
                     expand = c(0,0)) +
  scale_fill_manual(name = "busco_measure", 
                    values = c("steelblue1", "steelblue4", "yellow3", "red3")) + 
  theme(text = element_text(size=10, family = "Arial")) + 
  ggtitle("BUSCO Results - Ascoymcota_odb12 Database")
```

Things to note
1) The 100 versus 200 does not really impact scores at all. Therfore, lets go with the 200 filter. 
2) These were never going to be high because of the DNA extraction and amplification you did. The fact they all similar and around 10% is good. 
2) A few things fail/do not assemble, so we can remove these. Note these are also the ones removed from the original analysis due to them having to few orthogroups.  

Also not included are the ones which did not pass the cleaning of 500bp   
- *tutex* -> this had contigs only <500, so there is nothing in the cleaned and sorted fasta file.  
- *ag340* -> this had contigs only <500, so there is nothing in the cleaned and sorted fasta file.  
- *jt4453* -> this had contigs only <500, so there is nothing in the cleaned and sorted fasta file.  


## Re-done Gene predictions RAXML

We have a total of 84 species that worked through gene prediction AND (even with ones with <1000 genes) we still got really good SCOs between all species.  

I did a preliminary analysis, and the three samples (rh722, rh1006, ag335) made a weird group so we have removed them. Also rooted on three species and this is what it gave.

```{r getting new gp metadata, include = F}
read.csv(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/asm_pipeline/tree_metadata.csv") %>% 
  mutate(across(where(is.character), ~na_if(., ""))) %>% 
  mutate(Genus_species_good = str_replace_all(Genus_species, "_", " ")) %>% 
  mutate(tiplabel = paste0(sample, " | ", Genus_species_good, " | ", Location)) -> tree_md_gp
```


### All SCO with Three Root

So when using the two species Arthur wanted for rooting it was a little weird (i.e. cant find as their is another species in there messing it up). The tree above shows there are three **Tiber spinoreticulatum** so maybe using all three will fix the rooting problem. (Spoiler, it does). Removed those old weird trees so now it is just the good ones with rooting with the three species. 

Number of species: 80
Number of SCO: 69 (niiiiiice)
Concatenation length: 42,911 bp

```{r importing all three root tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/asm_pipeline/raxml_all/RAxML_bipartitions.tubme_scoall_100bs_3root") -> all_data_3
all_data_3$edge.length <- all_data_3$edge.length * 120
```

```{r annotating tree data all three root, include = F}
all_data_3 %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_all

all_data_3 %>% 
  as_tibble() %>%
  full_join(tree_md_gp %>% 
              dplyr::filter(sample %in% samples_all$label), 
            join_by(label == sample)) %>%
  as.treedata() -> all_data_3
```

```{r all three root making the node shapes and colours, include = F}
all_data_df_3 <- fortify(all_data_3)

all_data_df_3 <- all_data_df_3 %>%
  filter(!isTip & !is.na(label)) %>% 
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r all three root getting nodelabs for better plot, fig.width = 15, fig.height = 15, include = F}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Continent, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = label, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Location, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places all three root, include = F}
highlight_samples <- c("jt13224", "vk4538", "am1126", "jt36268", "tr64", "rh847", "mes1418")

all_data_3 %>%
  as_tibble() %>%
  mutate(highlight = label %in% highlight_samples) %>%
  as.treedata() -> all_data_3
```

```{r plotting the all three root tree, fig.width = 15, fig.height = 17, echo = F, results='hide'}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = tiplabel,
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 16) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = all_data_df_3,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N119 Colouring
  geom_cladelab(
    node = 119,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 119,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## Asia N112 Colouring
  geom_cladelab(
    node = 112,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 112,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## North America N86 Colouring
  geom_cladelab(
    node = 86,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 86,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
    ## North America N141 Colouring
  geom_cladelab(
    node = 137,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 137,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  )
```


### Greater 95% Orthologs

Number of species: 80  
Number of SCO: 310  
Concatenation length: 202,736 bp  

```{r importing g95 three root tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/asm_pipeline/raxml_g95/RAxML_bipartitions.tubme_scog95_100bs_3root") -> all_data_3
all_data_3$edge.length <- all_data_3$edge.length * 120
```

```{r annotating tree data g95 three root, include = F}
all_data_3 %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_all

all_data_3 %>% 
  as_tibble() %>%
  full_join(tree_md_gp %>% 
              dplyr::filter(sample %in% samples_all$label), 
            join_by(label == sample)) %>%
  as.treedata() -> all_data_3
```

```{r g95 three root making the node shapes and colours, include = F}
all_data_df_3 <- fortify(all_data_3)

all_data_df_3 <- all_data_df_3 %>%
  filter(!isTip & !is.na(label)) %>% 
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r g95 three root getting nodelabs for better plot, fig.width = 15, fig.height = 15, include = F}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Continent, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = label, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Location, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g95 three root, include = F}
highlight_samples <- c("jt36268", "jt13224", "am1126", "rh847", "mes1418")

all_data_3 %>%
  as_tibble() %>%
  mutate(highlight = label %in% highlight_samples) %>%
  as.treedata() -> all_data_3
```

```{r plotting the g95 three root tree, fig.width = 15, fig.height = 17, echo = F, results='hide'}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = tiplabel,
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 16) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = all_data_df_3,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N130 Colouring
  geom_cladelab(
    node = 130,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 130,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## Asia N151 Colouring
  geom_cladelab(
    node = 151,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 151,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
## Europe N149 Colouring
  geom_cladelab(
    node = 149,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 149,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## North America N85 Colouring
  geom_cladelab(
    node = 85,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 85,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
    ## North America N107 Colouring
  geom_cladelab(
    node = 107,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 107,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  )
```


### Greater 90% Orthologs

Number of species: 80  
Number of SCO: 370  
Concatenation length: 239,864 bp  

```{r importing g90 three root tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/asm_pipeline/raxml_g90/RAxML_bipartitions.tubme_scog90_100bs_3root") -> all_data_3
all_data_3$edge.length <- all_data_3$edge.length * 120
```

```{r annotating tree data g90 three root, include = F}
all_data_3 %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_all

all_data_3 %>% 
  as_tibble() %>%
  full_join(tree_md_gp %>% 
              dplyr::filter(sample %in% samples_all$label), 
            join_by(label == sample)) %>%
  as.treedata() -> all_data_3
```

```{r g90 three root making the node shapes and colours, include = F}
all_data_df_3 <- fortify(all_data_3)

all_data_df_3 <- all_data_df_3 %>%
  filter(!isTip & !is.na(label)) %>% 
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r g90 three root getting nodelabs for better plot, fig.width = 15, fig.height = 15, include = F}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Continent, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = label, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Location, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g90 three root, include = F}
highlight_samples <- c("jt13224", "am1126", "rh847", "mes1418")

all_data_3 %>%
  as_tibble() %>%
  mutate(highlight = label %in% highlight_samples) %>%
  as.treedata() -> all_data_3
```

```{r plotting the g90 three root tree, fig.width = 15, fig.height = 17, echo = F, results='hide'}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = tiplabel,
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 16) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = all_data_df_3,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N143 Colouring
  geom_cladelab(
    node = 143,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 143,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## Asia N133 Colouring
  geom_cladelab(
    node = 133,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 133,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
## Europe N140 Colouring
  geom_cladelab(
    node = 140,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 140,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
## Europe N131 Colouring
  geom_cladelab(
    node = 131,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 131,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## North America N107 Colouring
  geom_cladelab(
    node = 107,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 107,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
    ## North America N85 Colouring
  geom_cladelab(
    node = 85,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 85,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  )
```


### Greater 85% Orthologs

Number of species: 80  
Number of SCO: 387   
Concatenation length: 250,299 bp  

```{r importing g85 three root tree, include = F}
read.newick(file = "/Users/benyoung/Library/CloudStorage/OneDrive-UCB-O365/projects/arthur_tubme/asm_pipeline/raxml_g85/RAxML_bipartitions.tubme_scog85_100bs_3root") -> all_data_3
all_data_3$edge.length <- all_data_3$edge.length * 150
```

```{r annotating tree data g85 three root, include = F}
all_data_3 %>% 
  as_tibble() %>% 
  filter(str_detect(label, "^[A-Za-z]")) %>% 
  dplyr::select(label) -> samples_all

all_data_3 %>% 
  as_tibble() %>%
  full_join(tree_md_gp %>% 
              dplyr::filter(sample %in% samples_all$label), 
            join_by(label == sample)) %>%
  as.treedata() -> all_data_3
```

```{r g85 three root making the node shapes and colours, include = F}
all_data_df_3 <- fortify(all_data_3)

all_data_df_3 <- all_data_df_3 %>%
  filter(!isTip & !is.na(label)) %>% 
  mutate(bootstrap = as.numeric(label)) %>%
  filter(bootstrap > 0) %>% 
  mutate(
    bootstrap = as.numeric(label),
    boot_bin = case_when(
      bootstrap >= 91 ~ "91–100",
      bootstrap >= 81 ~ "81–90",
      bootstrap >= 71 ~ "71–80",
      bootstrap >= 61 ~ "61–70",
      bootstrap >= 51 ~ "51–60",
      bootstrap >= 41 ~ "41–50",
      bootstrap >= 31 ~ "31–40",
      bootstrap >= 21 ~ "21–30",
      bootstrap >= 11 ~ "11–20",
      TRUE ~ "0–10"
    )
  )
# View(tree_df_80)

bin_colors <- c(
  "91–100" = "grey60",
  "81–90" = "darkblue",
  "71–80" = "blue",
  "61–70" = "dodgerblue3",
  "51–60" = "skyblue3",
  "41–50" = "chartreuse4",
  "31–40" = "goldenrod2",
  "21–30" = "orange",
  "11–20" = "orangered2",
  "0–10"   = "red"
)
```

```{r g85 three root getting nodelabs for better plot, fig.width = 15, fig.height = 15, include = F}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Continent, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = label, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)

ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = Location, 
                  fontface = "italic"),
              size = 3,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  geom_nodelab(aes(x = branch, label = node),
               size = 4)
```

```{r making samples bold that are in weird places g85 three root, include = F}
highlight_samples <- c("jt13224", "am1126", "rh847", "mes1418")

all_data_3 %>%
  as_tibble() %>%
  mutate(highlight = label %in% highlight_samples) %>%
  as.treedata() -> all_data_3
```

```{r plotting the g85 three root tree, fig.width = 15, fig.height = 17, echo = F, results='hide'}
ggtree(all_data_3, 
       ladderize = T, 
       size = 0.5) +
  geom_tiplab(aes(label = tiplabel,
                  fontface = ifelse(highlight, "bold.italic", "italic")),
              size = 4,
              align = F,
              family = "times") +
  xlim(-1, 18) +
  geom_rootedge(rootedge = 1) +
  # geom_nodelab(aes(x = branch, label = label),
  #              vjust = -.5,
  #              size = 3) +
  geom_point(
    data = all_data_df_3,
    aes(x = x, y = y, fill = boot_bin),
    shape = 21,
    size = 2.5,
    stroke = 0.2,
    color = "black"  # optional border for visibility
  ) +
  scale_fill_manual(values = bin_colors, 
                    name = "Bootstrap %") +
  ## Europe N122 Colouring
  geom_cladelab(
    node = 122,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.3,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 122,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## Asia N112 Colouring
  geom_cladelab(
    node = 112,
    label = "Asia",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 112,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
## Europe N119 Colouring
  geom_cladelab(
    node = 119,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 119,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
## Europe N110 Colouring
  geom_cladelab(
    node = 110,
    label = "Europe",
    fontsize = 6,
    align = F,
    offset = 3.7,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 110,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
  ## North America N86 Colouring
  geom_cladelab(
    node = 86,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 86,
    fill = "grey60",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  ) +
    ## North America N137 Colouring
  geom_cladelab(
    node = 137,
    label = "North America",
    fontsize = 6,
    align = F,
    offset = 4,
    offset.text = 0.05,
    textcolor = 'black',
    barcolor = 'darkgrey',
    family = "times"
  ) +
  geom_hilight(
    node = 137,
    fill = "steelblue2",
    type = "gradient",
    gradient.direction = 'rt',
    alpha = .3,
    to.bottom = T
  )
```


## Ben To Do 

So I think some of the following steps should be done. 

- Reassemble data, programs have updated with better algorithms so it may make better assemblies
- Prior to assembly, run trimmed reads through a bacterial and viral blast screen and removal based on bit scores.  
- BUSCO analysis of all assemblies to get some asm stats. Better than just assembling and throwing into a downstream analysis.  
- Any other ideas.  

 
## Questions and Answers from Arthur - 22 July 2025 

1. What method was used for library prep and sequencing. 
- Modified CTAB phenol chlorform protocol.  
- RNA baits to hybridize ares of interest.  
- Magnetic bead stand to pull and do washes and then library prepped.  
- Nextera DNA Flex library prep (low concs).  
- Mi-Seq run for the raw reads (PE150 from `fastqc` report).  

2. What was the original assembly command with spades ?
- it is in `/pl/active/fungi1/argr6723_2-15-21/Rufum_TC/Assemblies_SPAdes`.  
- ran trimmomatic as well.  

2. Where are the raw reads
- No raw reads anymore :,(.  
- This has trimmed raw reads - `/pl/active/fungi1/argr6723_2-15-21/Rufum_TC/archive.tar`.  

3. What did you sequence: pure cultures, dirty samples, etc. etc.  
- Musuem Herbarium samples and dried. 1 year to 110 years.  

4. If dirty samples, did you do a prokaryotic/viral/metazoan cleaning before assembly?
- No cleaning step, just flag for meta genome.  
- Run custome prok and viral blast analysis prior to assembly.  
- Tuly genome, 110/114 years IMPORTANT (and has good orthogroups, great success).  
- Re assemble with updated programs.  

5. From the data we have, there are species with no orthologs really (see excel). 
- Alisha said to work down pairwise for orthologs.  
- Do every 5% orthologs in species from 90% to 50% (thats nine trees).  

6. If I make a spreadsheet with all the names (i.e. ag213, jf1136 etc.) can you fill in the column inormation for me. 
- Arthur will do yayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy.  

7. Things to remove (i.e. rerun everything from protortho onwards with these removed). 
- rh1006.  
- rh722.  
- g4740.  
- Tubme1v2.  

8. Rooting on the tree will be which species 
- rh1988.  
- ag342.  
- rh721.  

